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Abstract 

We propose in this paper an exploratory analysis algorithm for functional data. 
The method partitions a set of functions into K clusters and represents each 
cluster by a simple prototype (e.g., piecewise constant). The total number of 
segments in the prototypes, P, is chosen by the user and optimally distributed 
among the clusters via two dynamic programming algorithms. The practical 
relevance of the method is shown on two real world datasets. 
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1. Introduction 

Functional Data Analysis [55] addresses problems in which the observations 
are described by functions rather than finite dimensional vectors. A well known 
real world example of such data is given by spectrometry in which each object 
is analysed via one spectrum, that is a function which maps wavelengths to 
absorbance values. Online monitoring of hardware is also a good example of 
such data: each object is described by several time series associated to physical 
quantities monitored at specified sampling rates. 

Another application domain is related to electric power consumption curves, 
also called (electric) load curves. Such data describes the electric power con- 
sumption over time of one household or one small or large industry. Load curves 
can be very voluminous since there are many consumers (over 35 millions in 
France) which can be observed during a long period (often several years) and 
at a rate up to one point every minute. Consequently, there is a strong need 
for applying unsupervised learning methods to summarize such datasets of load 
curves. A typical way to do so is to split each curve into daily (or weekly) 
periods and perform a clustering of such daily curves. The motivation of such 
analyses is related to several practical problems: understanding the consumption 
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behaviour of consumers in relation to their equipment or weather conditions, 
defining new prices, optimizing the production of electric power, and in the next 
years monitoring the household consumption to face high peaks. 

We focus in this paper on the exploratory analysis of a set of curves (or time 
series) . The main idea is to provide the analyst with a summary of the set with 
a manageable complexity. A classical solution for multivariate data consists in 
using a prototype based clustering approach: each cluster is summarized by its 
prototype. Standard clustering methods such as K means and Self Organizing 
Map have been adapted to functional data and could be used to implement 
this solution [TJ [SJ [55]. Another possibility comes from the symbolization 
approaches in which a time series is represented by a sequence of symbols. In the 
SAX approach [24] . the time series is transformed into a piecewise representation 
with contiguous time intervals of equal size: the value associated with each 
interval is the average of actual values in the interval. This approach is very 
fast but does not give any guarantee on the associated error. In [15], a piecewise 
constant approximation of a time series is constructed via a segmentation of the 
time domain into contiguous intervals on which the series is represented by its 
average value, which can be turned into a label in a subsequent quantization 
step. When we are given a set of curves, an unique segmentation can be found in 
order to represent all the curves on a common piecewise constant basis (see [28] 
for an optimal solution). This was used as a preprocessing step in e.g. [151150] . 

We propose in this paper to merge the two approaches: we build a K means 
like clustering of a set of functions in which each prototype is given by a simple 
function defined in a piecewise way. The input interval of each prototype is par- 
titioned into sub-intervals (segments) on which the prototype assumes a simple 
form (e.g., constant). Using dynamic programming, we obtain an optimal seg- 
mentation for each prototype while the number of segments used in each cluster 
is also optimally chosen with respect to a user specified total number segments. 
In the case of piecewise constant prototypes, a set of functions is summarized 
via 2P — K real values, where K is the number of prototypes and P the total 
number of segments used to represent the prototypes. 

The rest of this paper is organized as follows. Section [2] introduces the prob- 
lem of finding a simple summary of a single function, links this problem with 
optimal segmentation and provides an overview of the dynamic programming 
solution to optimal segmentation. Section [3] shows how single function sum- 
marizing methods can be combined to clustering methods to give a summary 
of a set of functions. It also introduces the optimal resource allocation algo- 
rithm that computes an optimal number of segments for each cluster. Section 
|4] illustrates the method on two real world datasets. 

2. Summarizing one function via optimal segmentation: a state-of- 
the-art 

The proposed solution is built on two elements that are mostly indepen- 
dent: any standard clustering algorithm that can handle functional data and 
a functional summarizing technique that can build an appropriate low com- 
plexity representation of a set of homogeneous functions. The present section 
describes the summarizing technique for a single function; the extension to a 
set of functions is described in Section [U 
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Building a summary of a function is deeply connected to building an opti- 
mal segmentation of the function, a problem which belongs to the general task 
of function approximation. Optimal segmentation has been studied under dif- 
ferent point of views (and different names) in a large and scattered literature 

(see [1 H E3 [ID HI [53 E2] referenCeS therein )- The § oal of the Present 
section is not to provide new material but rather to give a detailed exposition 
of the relations between summary and segmentation on the one hand, and of 
the optimal segmentation framework itself on the other hand. 

2.1. Simple functions 

We consider a function s sampled in M distinct points {tk)k=i from the 
interval [ti , tjw] (points are assumed to be ordered, i.e., ti < t 2 < ■ ■ ■ < 
Our goal is to approximate s by a simpler function g on [t±, tj^]. The simplicity 
concept targeted in this article is not based on smoothness as in regularization 
[33] or on capacity as in learning theory (see e.g., [§])■ It corresponds rather 
to an informal measure of the simplicity of the visual representation of g for a 
human analyst. 

The difference between those three aspects (smoothness, capacity and visual 
simplicity) can be illustrated by an elementary example: let us assume that g 
is chosen as a linear combination of N fixed functions, i.e., 

N 

ff(t) = (!) 

i=l 

It is well known that the set of indicator functions based on functions of the 
previous form has a VC dimension of N, as long as the functions (4>i)iLi are 
linearly independent (see e.g., [5]). If we consider now the L 2 norm of the 
second derivative of g as a roughness measure, it is quite clear that the actual 
smoothness will depend both on the functions (4>i)fLi and on the values of the 
coefficients (fii)f = i- As an extreme case, one can consider <fii(t) = sign(t) while 
the (<f>i)iL 2 are smooth functions. Then any g with fi\ ^ is non smooth while 
all other functions are. 

The visual complexity of g will also clearly depend on both the basis func- 
tions and the coefficients, but in a way that will be generally quite opposite to 
the smoothness. Let us again consider an extreme case with N = 2, the interval 
[0, 1] and two choices of basis functions, (</>i, 2 ) and (ipi, ipz) defined as follows: 

, , x fl if * < I , , x [O if t < i 



if*>§ U ift>5 



and 



Mt) = exp ^-16 j M*) = cxp ^-8 (j-tj V ( 3 ) 

Obviously, functions generated by (</>i,<^2) are far less smooth than functions 
generated by (ipi,^) (the former are piecewise constant). However, as shown 
on Figure [TJ funct ions generated by (^i,02) are much easier to describe and 
to understand than functions generated by ("01, V^)- Indeed, piecewise constant 
functions admit simple textual description: the function on the left part of 
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Figure 1: Left: function generated by (<f>\,(j>2)\ right: function generated by {ipi,ip2) 



Figure [T] takes values 0.5 for the first half of the period of study and then 
switches to 1 for the last part. In the contrary, smooth functions are inherently 
more complex to describe: the function on the right part of Figure [I] needs a 
long textual description specifying its evolution trough time. In addition, no 
reasonable textual description will give a way to reconstruct exactly the original 
function. 

To summarize, classes of functions with identical capacity (as measured by 
the VC dimension) can contain functions of arbitrary smoothness and arbitrary 
unrelated complexity. We need therefore a specific solution to induce a simple 
approximation of a function. Experimental studies in information visualization 
(see, e.g., [13]) have shown that different tasks (e.g., clustering, change detection, 
etc.) have different type of adapted visual features. In our context, we target 
applications in which a textual (written or spoken) description of behavior of 
the function is useful. For instance in mass spectrometry, peaks positions are 
very informative and correspond to descriptions such as "there is a peak of this 
height at that position" . In the case of electrical consumption, the mean level 
consumption during a time frame is an interesting aspect and corresponds to 
description such as "the outlet consumes this amount of power from time a to 
time 6" . Hardware monitoring provides another example in which quantities of 
interest are generally constant on some time frames and switch linearly from 
one stable value to another. 

To implement this idea, we rely on a segmentation approach detailed in the 
next section (and covered in a large body of literature, as recalled in the intro- 
duction). A simple function in our context is given by a partition of [tiji/w] 
into a small number of segments on which the function takes a simple applica- 
tion dependent parametric form. If the parametric form is reduced to a single 
value, the simple function is piecewise constant and is easy to describe in text. 
Other convenient parametric forms are affine functions (to provide less jumpy 
transitions) and peaks. This approach is closely related to (and inspired by) 
symbolization techniques used for instance in time series indexing [15] : the idea 
is to build a piecewise constant approximation of a function and then to quantize 
the levels (the constants) into a small number of values. The obtained values 
are replaced by symbols (e.g., letters) and the time series is therefore translated 
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into a sequence of symbols. 
2.2. Segmentation 

Let us analyze first an example of the general scheme described in the previ- 
ous section. We consider a piecewise constant approximation given by P inter- 
vals (I p )p = i and the corresponding P values (a p )p =1 (we assume that (J p )^Lj 
is a partition of [ti, £m])- The approximating function g is defined by g(t) = a p 
when t G I p . If we consider the L 2 error criterion, we obtain a particular case 
of the Stone approximation problem [32 as we have to minimize 

£ (s, (IX =1 , K)£ =1 ) = / (s(t) - g(t)fdt = / (s(t) - a p fdt. (4) 
Jti p J i p 

The integral can be approximated by a quadrature scheme. We consider the 
simplest setting in which a summation over {tk)^ =1 provides a sufficient approx- 
imation, but the generalization to weighted schemes such as [23] is straightfor- 
ward. In practice, we optimize therefore 

E (s, {Ip)^ x , {a p ) p p=1 ) =J2J2 - a vf- (5) 

P tk£l P 

The difficulty lies in the segmentation, i.e., in the choice of the partition [Ip)p—i, 
as, given this partition, the optimal values of the (a p )^ =1 are equal to the fi p = 
TFT EtfcGi s (tk), where \I p \ denotes the number of tk in I p . 

More generally, the proposed method is built upon a parametric approxima- 
tion on each segment. Let us denote Q(s,I) the minimal error made by this 
parametric approximation of s on interval /. The piecewise constant approxi- 
mation described above corresponds to 

Q( s ,/)= W s (t) * . (6) 

tei \ 11 vei ) 

Similar expressions can be derived for a piecewise linear approximation or any 
other reasonable solution. For instance 

Q( S ,J) = 5>(t)-m( S ,J)|, (7) 
tei 

where m(s,/) is the median of {s(tk)\tk € /}, provides a more robust solution. 
Given Q, the optimal segmentation of s is obtained by minimizing 

p 

over the partitions (I p )p =l of [ti,tM]- This formulation emphasizes the central 
hypothesis of the method: the optimal approximation on a segment depends 
only on the values taken by s in the segment. While a more general segmentation 
problem, in which this constraint is removed, can be easily defined, its resolution 
is much more computationally intensive than the one targeted here. 
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2.3. Optimal segmentation 

Bellman showed in [3] that the Stone approximation problem |32j problem 
can be solved exactly and efficiently by dynamic programming: this is a conse- 
quence of the independence between the approximation used in each segment. 
This has been rediscovered and/or generalized in numerous occasions (see e.g., 

0HSII25I). 

A different point of view is given in \21\ I22j . As the function s is known only 
through its values at (ifc)fcLii defining a partition of [£i,£m] m to P intervals 
{Ip)p=i is n °t needed. A partition (C p ) p=1 of t\ < £2 < • • • < £m is sufficient, as 
long as it is ordered: if tk G C v and ti € C p , then {ffc,tfc+i, ...,£;} C C p . The 
goal of the segmentation is then to find an optimal ordered partition of {tk)k=\ 
according to the error criterion defined in equation ([5]) (where I p is replaced by 
C p and Q is modified accordingly). As shown in [211 |2"2"] . this problem can be 
solved by dynamic programming as long as the error criterion is additive, i.e., 
of the general form 

E{ S ,(C p )? =1 )=Y,Q(s,C p ). (9) 

p 

Moreover, the summation operator can be replaced by any commutative 
aggregation operator, e.g., the maximum. As already mentioned in the previous 
section, the crucial point is that the error criterion is a combination of values 
obtained independently on each cluster. 

The dynamic programming algorithm for minimizing the cost of equation 
Q is obtained as follows. We define first a set of clustering problems: 

F(s,k,j) = min £)Q(«,C P ). (10) 

(Cp)p=l> ordered partition of {t k ,t k+1 ,...,t M } 

The basic idea is to compute F(s, k,j) using F(s, ., j — 1), as the best partition 
into j clusters is obtained with 

F(s,k,j)= min Q(s,{t k ,...,ti}) + F(s,l + l,j-l). (11) 

k<l<M-j+l 

Indeed the partition is ordered and therefore, up to a renumbering of the clus- 
ters, there is I such that C\ = {t%, . . . , £;}. Moreover, the quality measure is 
additive and therefore finding the best partition in j clusters with the constraint 
that Ci = {ti, . . . , ti} corresponds to finding the best partition of {£/+i, . . . , £jw} 
in j — 1 clusters. This leads to Algorithm[T] in which W(k, j) is the winner split, 



i.e., the I that realizes the minimum in equation (111. The final loop is the back- 
tracking phase in which the split positions are assembled to produce the result 
of the algorithm, C, which gives the last indexes of the P — 1 first clusters of 
the partition. It should be noted that while Algorithm [T] outputs only the opti- 
mal partition in P clusters, an actual implementation will of course provide in 
addition the optimal model. Moreover, as the algorithm produces all values of 
F(., .), it can provide at once all optimal partitions in p clusters for p ranging 
from 1 to P. This is done via P backtracking loops for a total cost in 0(P 2 ). 

Figures [2] and [3] show an example of the results obtained by the algorithm. 
The original function on the left side of Figure [2] is a spectrum from the Tecator 



6 



Algorithm 1 Dynamic programming summarizing of a single function 
l: for k = 1 to M do 

2: F(s,fc, 1) <- Q(s,{t k ,t k+ i,.. .,t M }) 

3: W^(fe, 1) <— NA {no meaningful value at for j = 1} 

4: end for 

5: for j = 2 to P do 

6: for = 1 to M — j + 1 do 

7: k, j) <- Q(s, {t k }) + F(s, k + l,j-l) 

8: W(k,l)<-k 

9: for I = k + 1 to M - j + 1 do 

10: if Q(s,{t k ,..., t t }) + F(s, l + lj- 1) <F(s,k,j) then 

11: F(s, <- Q({«, t fc) . . . , ti}) + F(s, I + 

12: W(k,l)<-1 

13: end if 

14: end for 

15: end for 

16: end for 

17: C<- (NA,...,NA) 

18: C(P- 1) <- W(1,P) 

19: for j = P - 2 to 2 do 

20: C(i)<" W{C{j + l) + 1,3 + 1) 

21: end for 



dataselQ for which Af = 100. The spectrum is segmented into 1 to 10 segments 
(positions of the optimal segments are given on the right side of Figure [2]). The 
resulting approximations for 4 and 6 segments are given on Figure [3] 




850 900 950 1000 1050 850 900 950 1000 1050 



Figure 2: Segmentation of a single function: original function on the left, segment positions 
on the right 



2.4- Algorithmic cost 

Given all the Q (s, {t k , . . Algorithm [l] runs in 0(PM 2 ). This is far 

more efficient that a naive approach in which all possible ordered partitions 



Data are available on statlib at http://lib.stat.cmu.edu/datasets/tecator and consist 
in near infrared absorbance spectrum of meat samples recorded on a Tecator Infratec Food 
and Feed Analyzer. 
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850 900 950 1000 1050 350 900 950 1000 1050 

wavelength wave length 

Figure 3: Piecewise constant approximations with 4 segments on the left and 6 segments on 
the right 

would be evaluated (there are rsprW] such partitions). However, an efficient 
implementation is linked to a fast evaluation of the Q (s, {tk, ■ ■ ■ , U}). For in- 
stance, a naive implementation based on equation ^ leads to a 0(M 3 ) cost 
that dominates the cost of Algorithm [T] 

Fortunately, recursive formulation can be used to reduce in some cases the 
computation cost associated to the Q (s,{tk, ■ ■ ■ ,ti}) to 0(M 2 ). As an illus- 
tration, Algorithm [2] computes Q for equation ^ (piecewise constant approx- 
imation) in 0(M 2 ). Similar algorithms can be derived for other choices of the 



Algorithm 2 Recursive calculation of Q (s, {tk, ■ ■ ■ , tj}) 

1: for fc = 1 to M do 

2: fl(s, {t k }) <- S(t k ) 

3: Q(s,{t k })^0 

4: end for 

5: for I = 2 to M do 

6: fa,...,*,}) <- \ [{I- {«!,...,«!_!})+«(«,)] 

7: Q(a, {*!,..., t,}) <- Q( S , {*!,..., + ^ [«(tj) - {*!,..., U})f 

8: end for 

9: for k = 2 to M 1 do 

10: for I = k + 1 to M do 

ll: /z(s,{^,...,t ; }) <- 7= ip T [(i-fc + 2)/i(s,{f fe _ 1 ,...,ii})-s(i fe _i)] 

12: Q(« s {t fc ,...,ti}) <- Q(a, {t fc _i,...,t,}) |Es$|[a(tfc-i) - 

/x(s,{^.,...,i ; })] 2 

13: end for 

14: end for 



approximation solution used in each segment. The memory usage can also be 
reduced from 0{M 2 ) in Algorithms [l] and [2] to O(M) (see, e.g., |2"T1 12"2] ) . 

2.5. Extensions and variations 

As mentioned before, the general framework summarized in equation ^ is 
very flexible and accommodates numerous variations. Those variations include 
the approximation model (constant, linear, peak, etc.), the quality criterion 
(quadratic, absolute value, robust Huber loss [2], etc.) and the aggregation 
operator (sum or maximum of the point wise errors). 
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Figure [4] gives an application example: the spectrum from Figure [2] is ap- 
proximated via a piecewise linear representation (on the right hand side of the 
Figure): as expected, the piecewise linear approximation is more accurate than 
the piecewise constant one, while the former uses less numerical parameters 
than the latter. Indeed, the piecewise linear approximation is easier to describe 




wave length wavelength 



Figure 4: Segmentation of a single function: piecewise constant with 10 segments (left) versus 
piecewise linear with 5 segments (right) 

than the piecewise constant approximation: one needs only to specify the 4 
breakpoints between the segments together with 10 extremal values, while the 
piecewise constant approximation needs 10 numerical values together with 9 
breakpoints. 

However, this example is particularly favorable for the piecewise linear ap- 
proximation. Indeed, the independence hypothesis embedded in equation Q 
and needed for the dynamic programming approach prevents us from intro- 
ducing non local constraints in the approximating function. In particular, one 
cannot request for the piecewise linear approximation to be continuous. This 
is emphasized on Figure [4] by the disconnected drawing of the approximation 
function: in fact, the function g is not defined between tk and tk+i if those two 
points belong to distinct clusters (in the case of Bellman's solution [3J, the func- 
tion is not well defined at the boundaries of the intervals). Of course, one can 
use linear interpolation to connect g(tk) with g(tk+i), but this can introduce 
in practice P — 1 additional very short segments, leading to an overly complex 
model. 

In the example illustrated on Figure |4j the piecewise approximation does 
not suffer from large jumps between segments. Would that be the case, one 
could use an enriched description of the function specifying the starting and 
ending values on each segment. This would still be simpler than the piecewise 
constant description. However, if the original function is continuous, using a 
non continuous approximation can lead to false conclusion. 

To illustrate the flexibility of the proposed framework beyond the simple 
variations listed above, we derive a solution to this continuity problem [55]. Let 
us first define 

<*,(<* '<» - 1 (»ft) - (s(tt) - 3( "»' t + _y ' - " ,(tt)) )' ■ («) 

This corresponds to the total quadratic error made by the linear interpolation of 
s on {tk, ■ ■ ■ , t{\ based on the two interpolation points (tk, s(tk)) and (ti, s(i/)). 
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Then we replace the quality criterion of equation ([9| by the following one: 

p 

E((k l )f =0 ) = ^Q(s,{t h _ 1 , ...,**,}), (13) 
i=i 

defined for an increasing series (fc/)/L with fco = 1 and fcp = M. Obviously, 
E((ki)f =Q ) is the total quadratic error made by the piecewise linear interpolation 
of s on [ti,t M ] based on the P + l interpolation points (i fci , s(ife ; ))fL . While 
(ki)f =0 does not correspond strictly to a partition of {ti, . . . , £Af } (because each 
interpolation point belongs to two segments, expect for the first and the last 
ones), the criterion E is additive and has to be optimized on an ordered struc- 
ture. As such, it can be minimized by a slightly modified version of Algorithm 
[T] Figure [5] shows the differences between the piecewise linear approach and the 




wavelength wave length 



Figure 5: Segmentation of a single function: piecewise linear with 5 segments (left) versus 
linear interpolation with 5 segments (right) 

linear interpolation approach. For continuous functions, the latter seems to be 
more adequate than the former: it does not introduce arbitrary discontinuity 
in the approximating function and provides a simple summary of the original 
function. 

The solutions explored in this section target specifically the summarizing 
issue by providing simple approximating function. As explained in Section [2T| 
this rules out smooth approximation. In other applications context not ad- 
dressed here, one can trade simplicity for smoothness. In [4 , a model based 
approach is used to build a piecewise polynomial approximation of a function. 
The algorithm provides a segmentation useful for instance to identify underlying 
regime switching as well as a smooth approximation of the original noisy sig- 
nal. While the method and framework are related, they oppose on the favored 
quality: one favors smoothness, the other favors simplicity. 

3. Summarizing several functions 

Let us now consider N functions (si)fL 1 sampled in M distinct points (tk)^Li 
from the interval [t\, tja\ (as in Section|2j the points are assumed to be ordered). 
Our goal is to leverage the function summarizing framework presented in Section 
[2] to build a summary of the whole set of functions. We first address the case of 
a homogeneous set of functions and tackle after the general case. 
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3.1. Single summary 

Homogeneity is assumed with respect to the chosen functional metric, i.e., 
to the error criterion used for segmentation in Section [2] for instance the L 2 ■ 
Then a set of functions is homogeneous if its diameter is small compared to the 
typical variations of a function from the set. In the quadratic case, we request 
for instance to the maximal quadratic distance between two functions of the set 
to be small compared to the variances of functions of the set around their mean 
values. 

Then, if the functions (si)f =1 are assumed to be homogeneous, finding a 
single summary function seems natural. In practice, this corresponds to finding 
a simple function g that is close to each of the s, as measured by a functional 
norm: one should choose a summary type (e.g., piecewise constant), a norm 
(e.g., L2) and a way to aggregate the individual comparison of each function to 
the summary (e.g., the sum of the norms). 

Let us first consider the case of a piecewise constant function g defined by P 
intervals (I p ) p= i (a partition of [ii,i&r]) and P values (a p )p =1 (with g(t) = a p 
when t G I p ). If we measure the distance between g and via the L 2 distances 
and consider the sum of L 2 distances as the target error measure, we obtain the 
following segmentation error: 

N P 

s (a,,)^) =EE E -a*) 2 , (14) 

i=l p=l t k el p 



using the same quadrature scheme as in Section [272] Huygens' theorem gives 

N N 

(si(tk) — Op) =N(n(tk)-a p ) + y J (sj(tk) - (Jtjtk)) , (15) 



i=l i=l 



where fj, = -h 5TJi=i s i i s ^he mean function. Therefore minimizing E from 



equation ( 14 ) is equivalent to minimizing 

p 

E' ((*)<=!, (op)^) -EE - «p) 2 • (16) 

p=l tfcG/p 

In other words, the problem consists simply in building an optimal summary 
of the mean curve of the set and is solved readily with Algorithm [T] The 
additional computational cost induced by the calculation of the mean function 
is in O(NAl). This will generally be negligible compared to the 0(PM 2 ) cost 
of the dynamic programming. 

The reasoning above applies more generally to piecewise models used with 
the sum of L 2 distances. Indeed if g(t) is such that g(t) = g p (t) when t G I p , 
then we have 

N N 

E ( s <(^) - 9 P (tk)f = N Oi(t fc ) - 9 p(t k )f + (*i(*fc) - V(tk)f , (17) 

i=i i=i 

which leads to the same solution: g can be obtained as the summary of the 
mean curve. The case of linear interpolation proposed in Section |2.5| is a bit 
different in the sense that interpolating several curves at the same time is not 
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Figure 6: Group summary via an linear interpolation of the mean; the grey dashed/dotted 
curves are the original ones, the grey solid curve is the mean curve and the black curve is the 
piecewise linear interpolation curve that provides the group summary 



possible. Therefore, the natural way to extend the proposal to a set of curves 
it to interpolate the mean curve. An example of the result obtained by this 
method is given in Figure [6] 

In the general case, we assume given an error criterion Q that applies to 
a set of contiguous evaluation points {£&,...,£;}: it gives the error done by 
the local model on {tk, ...,£;} as an approximation of all the functions (si)^L 1 . 
Then given an ordered partition of {ii, . . . (Gp)p=i> we use the sum of 

the Q{{si)f =1: C p ) as a global error measure and obtain an additive form as in 
equation Algorithm [I] applies straightforwardly (it applies also to taking, 
e.g., the maximum of the Q((si)^L 1) C p ) over p). 

The only difficulty compared to the case of a unique function is the efficient 
calculation of the Q((si)^L 1 , {ifc, . . . , £/}). As shown above, the L2 case is very 
favorable as it amounts to summarizing the mean curve. However, more complex 
norms and/or aggregation rules lead to difficulties. If we use for instance 

l 

{tk, ■ ■ = min max V(s 4 (£ 3 ) - a) 2 , (18) 

a Ki<N ^— ' 
- - j = k 

the problem cannot be formulated as a single curve summarizing problem, while 
is consists simply in considering the maximal L2 distance rather than the sum 
of the Li distances: if the curves have different variances on . . . the 
optimal value a will not be the mean on the set of evaluation points of the 
mean curve. More generally, the computation of all Q((si)fL 1 , {tk, ■ ■ ■ , U}) might 
scale much faster than in 0(M(M + N)) and thus might dominate the total 
computation time. Indeed, if there is no closed form expression for Q, then the 
value has to be computed from scratch for each pair (fc, Z), which implies 0(M 2 ) 
problem solving. Then, if no special structure or preprocessing can be used to 
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simplify the problem, computing one value of Q implies clearly to look at all the 
involved functions, i.e, to O(N) observations. Therefore, if no simplification can 
be done, computing all values of Q will imply at least at 0(NM 2 ) calculation 
cost. Care must therefore be exercised when choosing the quality criterion 
and the summarizing model to avoid excessive costs. That said, the dynamic 
programming approach applies and leads to an optimal segmentation according 
to the chosen criterion. 

3.2. Clustering 

Assuming that a set of functions is homogeneous is too strong in practice, 
as shown on Figure [6j some of the curves have a small bump around 940 nm 
but the bump is very small in the mean curve. As a consequence, the summary 
misses this feature while the single curve summary used in Figure [5] was clearly 
picking up the bump. 

It is therefore natural to rely on a clustering strategy to produce homoge- 
neous clusters of curves and then to apply the method described in the previous 
section to summarize each cluster. However, the direct application of this strat- 
egy might lead to suboptimal solutions: at better solution should be obtained 
by optimizing at once the clusters and the associated summaries. We derive 
such an algorithm in the present section. 

In a similar way to the previous section, we assume given an error measure 
Q(G, {tfe, . . . ,ti}) (where G is a subset of {1, . . . , N}): it measures the error 
made by the local model on the {ifc, . . . , t{\ as an approximation of the functions 
Si for i e G. For instance 

i 

Q(G, {t k , t,}) = nun ~ a )'' ( 19 ) 

i£G j=k 

for a constant local model evaluated with respect to the sum of distances. Let 
us assume for now that each cluster is summarized via a segmentation scheme 
that uses a cluster independent number of segments, P/K (this value is assumed 
to be an integer). Then, given a partition of {1, ... , N} into K clusters (Gk)% = i 
and given for each cluster an ordered partition of {ti, . . . ,t M }, , the 

global error made by the summarizing scheme is 

/ k \ K P/K 

E (^ 1 ,(G fc )f =1 ,((C p fe )^f) fe=i U^^Q(G fc ,C )D fe ). (20) 

^ ' k=l p=l 

Building an optimal summary of the set of curves amounts to minimizing this 
global error with respect to the function partition and, inside each cluster, with 
respect to the segmentation. 

A natural way to tackle the problem is to apply an alternating minimiza- 
tion scheme (as used for instance in the K- means algorithm): we optimize suc- 
cessively the partition given the segmentation and the segmentation given the 
partition. Given the function partition (Gk) k=1 , the error is a sum of indepen- 
dent terms: one can apply dynamic programming to build an optimal summary 
of each of the obtained clusters, as shown in the previous section. Optimizing 
with respect to the partition given the segmentation is a bit more complex as 
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the feasibility of the solution depends on the actual choice of Q. In general how- 
ever, Q is built by aggregating via the maximum or the sum operators distances 
between the functions assigned to a cluster and the corresponding summary. 
It is therefore safe to assume that assigning a function to the cluster with the 
closest summary in term of the distance used to define Q will give the optimal 
partition given the summaries. This leads to Algorithm [3j The computational 

Algorithm 3 Simultaneous clustering and segmentation algorithm (uniform 
case) 

1: initialize the partition (Gfc){Lj randomly 
2: repeat 

3: for k = 1 to K do 

4: apply Algorithm [T] to the find the optimal summary of the set of 

curves {si\i € Gk} in P/K segments 
5: end for 
6: for i = 1 to N do 

7: assign a* to Gk such that the distance between a* and gk is minimal 

over k 
8: end for 
9: until (G / t)^L 1 is stable 



cost depends strongly on the availability of an efficient calculation scheme for 
Q. If this is the case, then the cost of the K dynamic programming applications 
grows in 0(K(P/K)M 2 ) = 0(PM 2 ). With standard L p distances between 
functions, the cost of the assignment phase is in O(NKM). Computing the 
mean function in each cluster (or similar quantities such as the median function 
that might be needed to compute efficiently Q) has a negligible cost compared 
to the other costs (e.g., O(NM) for the mean functions). 

Convergence of Algorithm [3] is obvious as long as both the assignment phase 
(i.e., the minimization with respect to the partition given the segmentation) and 
the dynamic programming algorithm use deterministic tie breaking rules and 
lead to unique solutions. While the issue of ties between prototypes is generally 
not relevant in the K-means algorithm, for instance, it plays an important role in 
our context. Indeed, the summary of a cluster can be very simple and therefore, 
several summaries might have identical distance to a given curve. Then if ties 
are broken at random, the algorithm might never stabilize. Algorithm [T] is 
formulated to break ties deterministically by favoring short segments at the 
beginning of the segmentation (this is implied by the strict inequality test used 
at line 10 of the Algorithm). Then the optimal segmentation is unique given 
the partition, and the optimal partition is unique given the segmentation. As a 
consequence, the alternative minimization scheme converges. 

Figure [7] gives an example of the results obtained by the algorithm in the case 
of a piecewise linear summary with five segments per cluster (the full Tecator 
dataset contains N = 240 spectra with M — 100 for each spectrum). 

In practice, we will frequently use a piecewise constant model whose quality 
is assessed via the sum of the L 2 distances, as in equation (19). Then equation 
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Figure 7: Summary of the full Tecator dataset with 6 clusters and 5 segments per cluster: 
grey curves are the original spectra, dark curves are the summaries for each cluster 



(20) can be instantiated as follows: 



^ ' k=neG k p=i t,-ec£ 



if P/-RT 



(21) 



ith 



Mfe.p = jg-^r E E (22) 

Let us denote Sj = (sj(ti), . . . , Si(ijVf)) the M M vector representation of the 
sampled functions and gk = (fl%(ti), • • • ,<7fe(ijw)) the vector representation of 
the piecewise constant functions defined by gk(U) — Mfc, P when ti € C*. Then 



equation (21) can be rewritten 

K 

E (( Si )£x, (G fe )f =1 , ((<7*)£ =1 )* =1 ) = E E II s * - ^f- ( 23 ) 

fc=i ieGfc 

This formulation emphasizes the link between the proposed approach and the 
K means algorithm. Indeed the criterion is the one optimized by the K means 
algorithm for the vector data with a constraint on the prototypes: rather 
than allowing arbitrary prototypes as in the standard K means, we use simple 
prototypes in the sense of Section |2.1| Any piecewise model whose quality is 
assessed via the sum of the L2 distances leads to the same formulation, but with 
different constraints on the shape of the functional prototypes. 



Equation ( 23 1 opens the way to many variations on the same theme: most of 



the prototype based clustering methods can be modified to embed constraints 
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on the prototypes, as Algorithm [3] does for the K means. Interesting candidates 
for such a generalization includes the Self Organizing Map in its batch version 
[18], Neural Gas also in its batch version [7] and deterministic annealing based 
clustering [27], among others. 

3.3. Optimal resources allocation 

A drawback of Algorithm[3]is that it does not allocate resources in an optimal 
way: we assumed that each summary will use P/K segments, regardless of the 
cluster. Fortunately, dynamic programming can be used again to remove this 
constraint. Let us consider indeed a resource assignment, that is K positive 
integers {Pk)k=i such that X^fcLi Pk = P- The error made by the summarizing 
scheme when cluster Gk uses Pk segments for its summary is given by 

K x K P k 



k=l p= 



As for the criterion given by equation ( 20 ) in the previous section, we rely on 
an alternating minimization scheme. Given the function partition (Gk)^ =1 , the 
challenge is to optimize E with respect to both the resource assignment and the 
segmentation. For a fixed partition (Gk)F_ 1: let us denote 



R k (P k )= p min £)Q(G fc> C£) (25) 

(Cp) P =l ordered partition of {ti ,. . . ,t M } „ = j 



Then minimizing E from equation (24) for a fixed partition (Gfe)|L 1 corresponds 
to minimizing: 

K 

U ((P k )^ =1 ) =Y J MPk) (26) 
fe=i 

with respect to the resource assignment. This formulation emphasizes two major 
points. 



Firstly, as already mentioned in Section 2.3 given a number of segments P 
Algorithm [l] and its variants provide with a negligible additional cost all the 
optimal segmentations in p = 1 to P segments. Therefore, the calculation of 
Rk{p) for p € {1, . . • , P} can be done with a variant of Algorithm [l] in 0{PM 2 ) 
provided an efficient calculation of Q(Gk, C*) is possible (in fact, one needs only 
values of Rk(p) for p < P — K + l as we request at least one segment per cluster, 
but to ease the calculation we assume that we compute values up to P). 



Secondly, U as defined in equation ( 26 1 is an additive measure and can 



therefore be optimized by dynamic programming. Indeed let us define S(l,p) as 

l 

S(l,p)= min Vi? fc (.P fe ) (27) 

(P k )' k=1 such that J2' k = 1 P k =p]Z? 1 

The S(l,p) are readily obtained from the optimal segmentation algorithm 



(5(1, p) — Ri(p)). Then the additive structure of equation (26) shows that 



S(l,p)= min S(l-l,p-u)+Ri(u). (28) 

l<u<p— Z + l 
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Algorithm 4 Simultaneous clustering and segmentation algorithm with optimal 
resource assignment 

l: initialize the partition {Gk)^ =1 randomly 
2: repeat 



3: for k = 1 to K do 

4: apply Algorithm [I] to compute Rk(p) for all p € {1, P — K + 1} 

5: end for 

6: for p = 1 to _P — K + l do {initialisation of the resource allocation phase} 

7: S(l,p)<-fli(p) 

8: end for 

9: for I = 2 to K do {dynamic programming for resource allocation} 

10: for p = I to P do 

11: W(l,p)<-1 

12: S(l,p)^S(l-l, P -l)+Rl(l) 

13: for u — 2 to p— Z + ldo 

14: if S(Z -l,p-u) + Ri{u) < S(l,p) then 

15: W(l,p) 4- U 

16: S(l,p) <- S(l- l,p-u) +Rl(ll) 

17: end if 

18: end for 

19: end for 

20: end for 

21: P opt 4 — (NA, . . . , NA) {backtracking phase} 

22: P opt (K) 4- W{K,P) 

23: availabe 4- P - W(K, P) 

24: for k = K — 1 to 1 do 

25: P op t(k) 4- W(k, available) 

26: availabe 4— availabe — W(k, available) 

27: end for 

28: for k = 1 to K do {summary calculation} 

29: build gk with P opt {k) segments 

30: end for 

31: for i = 1 to N do 

32: assign Sj to Gk such that the distance between Sj and is minimal 

over A; 

33: end for 



34: until (G fe )^ 1 is stable 
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Given all the Ri(p) the calculation of S(K,P) has therefore a cost of 0(KP 2 ). 
This calculation is plugged into Algorithm [3] to obtain Algorithm [4] 

Given the values of Q, the total computational cost of the optimal resource 
assignment and of the optimal segmentation is therefore in 0(KP{M 2 + P)). 
As we target simple descriptions, it seems reasonable to assume that P <C 
M 2 and therefore that the cost is dominated by 0(KPM 2 ). Then, using the 
optimal resource assignment multiplies by K the computational cost compared 
to a uniform use of P/K segments per cluster (which computational cost is in 
0(PM 2 )). The main source of additional computation is the fact that one has 
to study summaries using P segments for each cluster rather than only P/K 
in the uniform case. This cost can be reduced by introducing an arbitrary 
(user chosen) maximal number of segments in each clusters, e.g., XP/K. Then 
the running time of the optimal assignment is A times the running time of the 
uniform solution. As we aim at summarizing the data, using a small A such 2 
seems reasonable. 

The computation of the values of Q remains the same regardless of the way 
the number of segments is chosen for each cluster. The optimisation of the 
error with respect to the partition (Gk)^ =1 follows also the same algorithm in 
both cases. If we use for instance piecewise constant approximation evaluated 
with the sum of the L2 distances, then, as already mentioned previously, the 
summaries are computed from the cluster means: computing the means costs 
O(NM), then the calculation of Q for all functional clusters is in 0(KM 2 ) 
and the optimization with respect to (Gk)^ =1 is in O(NKM). In general, the 
dominating cost will therefore scale in 0(KPM 2 ). However, as pointed out 
in Section |3.1[ complex choices for the Q error criterion may induce larger 
computational costs. 

3-4- Two phases approaches 

The main motivation of the introduction of Algorithm [3] was to avoid rely- 
ing on a simpler but possibly suboptimal dual phases solution. This alternative 
solution simply consists applying a standard K means to the vector representa- 
tion of the functions (si)fL 1 to get homogeneous clusters and then to summarize 
each cluster via a simple prototype. While this approach should intuitively give 



higher errors for the criterion of equation ( 20 ) , the actual situation is more 
complex. 

Algorithms [3] and [4] are both optimal in their segmentation phase and will 
generally be also optimal for the partition determination phase if Q is a standard 
criterion. However, the alternating optimization scheme itself is not optimal (the 
K means algorithm is also non optimal for the same reason) . We have therefore 
no guarantee on the quality of the local optimum reached by those algorithms, 
exactly as in the case of the K means. In addition, even if the two phase 
approach is initialized with the same random partition as, e.g., Algorithm[3j the 
final partitions can be distinct because of the distinct prototypes used during 
the iterations. It is therefore difficult to predict which approach is more likely to 
give the best solution. In addition a third possibility should be considered: the K 
means algorithm is applied to the vector representation of the functions (si)fL 1 
to get homogeneous clusters and is followed by an application of Algorithm [3] 
or [4] using the obtained clusters for its initial configuration. 

To assess the differences between these three possibilities, we conducted a 
simple experiment: using the full Tecator dataset, we compared the value of 



18 



E from equation (20) for the final summary obtained by the three solutions 
using the same random starting partitions with K = 6, P = 30 and piecewise 
constant summaries. We used 50 random initial configurations and compared 
the methods in the case of uniform resource allocation (5 segments per cluster) 
and in the case of optimal resource allocation. 

In the case of uniform resources assignment the picture is rather simple. For 
all the 50 random initialisations both two phases algorithms gave exactly the 
same results. In 44 cases, using Algorithm [3] gave exactly the same results as 
the two phases algorithms. In 5 cases, its results were of lower quality and in 
the last case, its results were slightly better. The three algorithms manage to 
reach the same overall optimum (for which E = 472). 

The case of optimal resources assignment is a bit more complex. For 15 
out of 50 initial configurations, using Algorithm [4] after the K means improves 
the results over a simple optimal summary (with optimal resource assignment) 
applied to the results of the K means (results were identical for the 35 other 
cases). In 34 cases, K means followed by Algorithm [4] gave the same results as 
Algorithm [4] alone. In the 16 remaining cases, results are in favor of K means 
followed by Algorithm[4]which wins 13 times. Nevertheless, the three algorithms 
manage to reach the same overall optimum (for which E — 467). 

While there is no clear winner in this (limited scope) experimental analysis, 
considering the computational cost of the three algorithms helps choosing the 
most appropriate one in practice. In favorable cases, the cost of an iteration of 
Algorithm [I] is 0(KMN+KPM 2 )) while the cost of an iteration of the K means 
on the same dataset is O(KMN). In functional data, M is frequently of the 
same order as N. For summarizing purposes, P should remain small, around 5K 
for instance. This means that one can perform several iterations of the standard 
K means for the same computational price than one iteration of Algorithm [4] 
For instance, the Tecator dataset values, P = 30, M = 100 and N = 240, give 
1 + PM/N = 13.5. In the experiments described above, the median number 
of iterations for the K means was 15 while it was respectively 15 and 16 for 
Algorithms [3] and [2] We can safely assume therefore that Algorithm [3] is at 
least 10 times slower than the K means on the Tecator dataset. However, when 
initialized with the K means solution, Algorithms [3] and [4] converges very quickly, 
generally in two iterations (up to five for the optimal assignment algorithm in 
the Tecator dataset). In more complex situations, as the ones studied in Section 
[4] the number of iterations is larger but remains small compared to what would 
be needed when starting from a random configuration. 

Based on the previous analysis, we consider that using the K means algo- 
rithm followed by Algorithm [4] is the best solution. If enough computing re- 
sources are available, one can run the K means algorithm from several starting 
configurations, keep the best final partition and use it as an initialization point 
for Algorithm |4| As the K means provides a good starting point to this latter 
algorithm, the number of iterations remains small and the overall complexity 
is acceptable, especially compared to using Algorithm [4] alone from different 
random initial configurations. Alternatively, the analyst can rely on more ro- 
bust clustering scheme such as Neural Gas [7] and deterministic annealing based 
clustering [27] . or on the Self Organizing Map [18 to provide the starting con- 
figuration of Algorithm |4j 
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4. Experimental results 



We give in this section two examples of the type of exploratory analyses that 
can be performed on real world datasets with the proposed method. In order to 
provide the most readable display for the user, we build the initial configuration 
of Algorithm|4]with a batch Self Organizing Map. We optimize the initial radius 
of the neighborhood influence in the SOM with the help of the Kaski and Lagus 
topology preservation criterion |17j (the radius is chosen among 20 radii). 

4-1. Topex/Poseidon satellite 

We study first the Topex/Poseidon satellite datasel]^] The Topex/Poseidon 
radar satellite has been used over the Amazonian basin to produce N = 472 
waveforms sampled at M — 70 points. The curves exhibit a quite large vari- 
ability induced by differences in ground type below the satellite during data 
acquisition. Details on the dataset can be found in e.g. [HJ [12] . Figure [8] 
displays 20 curves from the dataset chosen randomly. 




Figure 8: 20 curves from the Topex/Poseidon radar satellite dataset 

We conduct first a hierarchical clustering (based on Euclidean distance be- 
tween the functions and the Ward criterion) to get a rough idea of the potential 
number of clusters in the dataset. Both the dendrogram (Figure [9]) and the 



total within class variance evolution (Figure 10 1 tend to indicate a rather small 
number of clusters, around 12. As the visual representation obtained via the 
SOM is more readable than a randomly arranged set of prototypes, we can use 
a slightly oversized SOM without impairing the analyst's work. We decided 
therefore to use a 4 x 5 grid: the rectangular shape has been preferred over a 



Data arc available at http://www.lsp.ups-tlse.fr/staph/npfda/npfda-datasets.html 



20 



o 




Figure 9: Dendrogram of the hierarchical clustering for the Topex/Poseidon radar satellite 
dataset 
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Figure 10: Total within class variance decrease for the 20 first merges in the dendrogram 
depicted by Figure [9] 
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square one according to results from [M] that show some superiority in topology 
preservation for anisotropic maps compared to isotropic ones. 




Figure 11: Prototypes obtained by SOM algorithm on the Topex/Poseidon radar satellite 
dataset (grey curves are the original curves as assigned to the clusters) 

The results of the SOM are given in Figure [TT] that displays the prototype 
of each cluster arranged on the SOM grid. Each cell contains also all the curves 
assigned to the associated clusters. As expected, the SOM prototypes are well 
organized on the grid: the horizontal axis seems to encode approximately the 
position of the maximum of the prototype while the vertical axis corresponds 
to the width of the peak. However, the prototypes are very noisy and inferring 
the general behavior of the curves of a cluster remains difficult. 
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Figure 12: Summarized prototypes obtained by Algorithm |4] on the Topex/Poseidon radar 
satellite dataset (grey curves are the original curves as assigned to the clusters) 
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Figure |12| represents the results obtained after applying Algorithm [4] to the 
results of the SOM. In order to obtain a readable summary, we set P to 80, 
i.e., an average of 4 segments for each cluster. We use the linear interpolation 
approach described in Section [2~5l Contrarily to results obtained on the Tecator 
dataset and reported in Section |3.4| Algorithm [4] used a significant number 
of iterations (17). Once the new clusters and the simplified prototypes were 
obtained, they were displayed in the same way as for the SOM results. The 
resulting map is almost as well organized as the original one (with the exception 
of a large peak on the bottom line), but the simplified prototypes are much more 
informative than the original ones: they give an immediate idea of the general 
behavior of the curves in the corresponding cluster. For example, the differences 
between the two central prototypes of the second line (starting from the top) 
are more obvious with summaries: in the cluster on the left, one can identify 
a slow increase followed by a two phases slow decrease, while the other cluster 
corresponds to a sharp increase followed by a slow decrease. Those differences 
are not as obvious on the original map. 

4.2. Electric power consumption 

The second datasel]^] consists in the electric power consumption recorded in 
a personal home during almost one year (349 days). Each curve consists in 
144 measures which give the power consumption of one day at a 10 minutes 



sampling rate. Figure 13 displays 20 randomly chosen curves. 
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Figure 13: 20 electric power consumption curves 

We analyse the curves in a very similar way as for the Topex/Poscidon 
dataset The results of the hierarchical clustering displayed by Figures [14] and 



This dataset is available at http : //bilab . enst . f r/wakka.php?wiki=HomeLoadCurve 
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[15] lead to the same conclusion as in the previous case: the number of clusters 
seems to be small, around 13. As before, we fit a slightly larger SOM (4x5 
rectangular grid) to the data and obtain the representation provided by Figure 
p~6] The map is well organized: the vertical axis seems to encode the global 
power consumption while the horizontal axis represents the overall shape of 
the load curve. However, the shapes of the prototypes are again not easy to 
interpret, mainly because of their noisy nature. 




Figure 14: Dendrogram of the hierarchical clustering for the load curves 



□□□□□□□□ □ □□□□ 



□ 



9 10 11 12 13 14 15 16 17 18 19 20 



Figure 15: Total within class variance decrease for the 20 first merges in the dendrogram 
depicted by Figure [T4] 

Figure [IT] shows the results obtained after applying Algorithm [4] to the re- 
sults of the SOM (Algorithm[4]used 9 iterations to reach a stable state). In order 
to obtain a readable summary, we set P to 80, i.e., an average of 4 segments 
for each cluster. We depart from the previous analysis by using here a piece- 
wise constant summary: this is more adapted to load curves as they should be 
fairly constant on significant time periods, because many electrical appliances 
have stable power consumption once switched on. As for the Topex/Poseidon 
dataset, the summarized prototypes are much easier to analyze than their noisy 
counterpart. Even the global organization of the map is more obvious: the top 
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Figure 16: Prototypes obtained by SOM algorithm for the load curves (grey curves arc the 
original curves as assigned to the clusters) 
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right of the map for instance, gathers days in which the power consumption 
in the morning (starting at 7 am) is significant, is followed by a period of low 
consumption and then again by a consumption peak starting approximately 
at 7 pm. Those days are week days in which the home remains empty during 
work hours. Other typical clusters include holidays with almost no consumption 
(second cluster on the first line), more constant days which correspond to week 
ends, etc. In addition, the optimal resource assignment tends to emphasize the 
difference between the clusters by allocating less resources to simple ones (as 
expected). This is quite obvious in the case of piecewise constant summaries 



used in Figure 17 (and to a lesser extent on Figure 12 ) 




Figure 17: Summarized prototypes (P = 80) obtained by Algorithm [4] for the load curves 
(grey curves are the original curves as assigned to the clusters) 
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In order to emphasize the importance of a drastic reduction in prototype 
complexity, Algorithm [4] was applied to the same dataset, using the same initial 
configuration, with P = 160, i.e., an average of 8 segments in each cluster. 
Results are represented on Figure[l8] As expected, prototypes are more complex 




Figure 18: Summarized prototypes (P = 160) obtained by Algorithm [5] for the load curves 
(grey curves are the original curves as assigned to the clusters) 



than on Figure 17 While they are not as rough as the original ones (see Figure 
16), their analysis is more difficult than the one of prototypes obtained with 
stronger summarizing constraints. In particular, some consumption peaks are 
no more roughly outlined as in Figure [17] but more accurately approximated 
(see for instance the last cluster of the second row). This leads to series of 
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short segments from which expert inference is not easy. It should be noted in 
addition, that because of the noisy nature of the data, increasing the number 
of segments improves only marginally the approximation quality. To show this, 
we first introduce a relative approximation error measure given by 

J2k=l 2~«eG fe Spii Z)t,GC*( S *(*j) — Mfc,p) 2 

Li=i2^i=i( s i(*j) - Mi) 

using notations from Section |3.2| and where fii denotes the mean of Sj. This rel- 
ative error compares the total squared approximation error to the total squared 
internal variability inside the dataset. Table [T] gives the dependencies between 
the number of segments and the relative approximation error. It is quite clear 
that the relative loss in approximation precision is acceptable even with strong 
summarizing constraints. This type of numerical quality assessment should be 

Summary No summary P = 160 P = 80 

Relative approximation error 0.676 0.696 0.740 

Table 1: Approximation errors for the load curves 

provided to the analyst either as a way to control the relevance of the summa- 
rized prototypes, or as a selection guide for choosing the value of P. 



5. Conclusion 

We have proposed in this paper a new exploratory analysis method for func- 
tional data. This method relies on simple approximations of functions, that 
is on models that are piecewise constant or piecewise linear. Given a set of 
homogeneous functions, a dynamic programming approach computes efficiently 
and optimally a simple model via a segmentation of the interval on which the 
functions are evaluated. Given several groups of homogeneous functions and a 
total number of segments, another dynamic programming algorithm allocates 
optimally to each cluster the number of segments used to build the summary. 
Finally, those two algorithms can be embedded in any classical prototype based 
algorithm (such as the K means) to refine the groups of functions given the 
optimal summaries and vice versa in an alternating optimization scheme. 

The general framework is extremely flexible and accommodates different 
types of quality measures (L2, L±), aggregation strategies (maximal error, mean 
error), approximation model (piecewise constant, linear interpolation) and clus- 
tering strategies (K means, Neural gas). Experiments show that it provides 
meaningful summaries that help greatly the analyst in getting quickly a good 
idea of the general behavior of a set of curves. 

While we have listed many variants of the method to illustrate its flexibility, 
numerous were not mentioned as they are more distant from the main topic of 
this paper. It should be noted for instance that rather than building a unique 
summary for a set of curves, one could look for a unique segmentation support- 
ing curve by curve summaries. In other words, each curve will be summarized 
by e.g., a piecewise constant model, but all models will share the same segmen- 
tation. This strategy is very interesting for analyzing curves in which internal 
changes are more important than the actual values taken by the curves. It is 
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closely related to the so called best basis problem in which a functional basis is 
built to represent efficiently a set of curves (see e.g., HU |3T] ) . This is also 
related to variable clustering methods (see e.g., [H]). This final link opens an 
interesting perspective for the proposed method. It has been shown recently 
[TOl |2"U] that variable clustering methods can be adapted to supervised learning: 
variables are grouped in a way that preserves as much as possible the explana- 
tory power of the reduced set of variables with respect to a target variable. It 
would be extremely useful to provide the analyst with such a tool for exploring a 
set of curves: she would be able to select an explanatory variable and to obtain 
summaries of those curves that mask details that are not relevant for predicting 
the chosen variable. 
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